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Abstract 

We present a method for calibrating the Ensemble of Ex¬ 
emplar SVMs model. Unlike the standard approach, which 
calibrates each SVM independently, our method optimizes 
their joint performance as an ensemble. We formulate joint 
calibration as a constrained optimization problem and de¬ 
vise an efficient optimization algorithm to find its global 
optimum. The algorithm dynamically discards parts of the 
solution space that cannot contain the optimum early on, 
making the optimization computationally feasible. We ex¬ 
periment with EE-SVM trained on state-of-the-art CNN de¬ 
scriptors. Results on the ILSVRC 2014 and PASCAL VOC 
2007 datasets show that (i) our joint calibration procedure 
outperforms independent calibration on the task of classify¬ 
ing windows as belonging to an object class or not; and (ii) 
this improved window classifier leads to better performance 
on the object detection task. 

1. Introduction 

The Ensemble of Exemplar SVMs [1] (EE-SVM) is a 
powerful non-parametric approach to object detection. It 
is widely used [2-12] because it explicitly associates a 
training example to each object it detects in a test im¬ 
age. This enables transferring meta-data such as segmen¬ 
tation masks [1, ], 3D models [1], viewpoints [12], GPS 
locations [1 ] and part-level regularization [A]. Further¬ 
more, EE-SVM can also be used for discovering objects 
parts [3, 5], scene classification [3, ], object classifica¬ 
tion [6], image parsing [7], image matching [2], automatic 
image annotation [11] and 3D object detection [8]. 

An EE-SVM is a large collection of linear SVM clas¬ 
sifiers, each trained from one positive example and many 
negative ones (an E-SVM). At test time each window is 
scored by all E-SVMs, and the highest score is assigned 
to the window. Because of this max operation, it is nec¬ 
essary to calibrate the E-SVMs to make their scores com¬ 
parable. A common procedure is to calibrate each SVM 
independently, by fitting a logistic sigmoid to its output on 
a validation set [1]. Such independent calibration, however, 
does not take into account that the final score is the max 
over many E-SVMs. Moreover, calibrating one E-SVM in 
isolation requires choosing which positive training samples 


it should score high and which ones it can afford to score 
low. Such a prior association of positive training samples 
to E-SVMs is arbitrary, as there is no predefined notion of 
how much and in which way a particular E-SVM should 
generalize. What truly matters is the interplay between all 
E-SVMs through the max operation. 

In this paper we present a joint calibration procedure that 
takes into account the max operation. We calibrate all E- 
SVMs at the same time by optimizing their joint perfor¬ 
mance after the max. Our method finds a threshold for 
each E-SVM, so that (i) all positive windows are scored 
positively by at least one E-SVM, and (ii) the number of 
negative windows scored positively by any E-SVM is min¬ 
imized. The first criterion ensures that there are no positive 
windows scored negatively after the max, while the second 
criterion minimizes the number of false positives. 

We formalize these two criteria in a well-defined con¬ 
strained optimization problem. The first requirement is for¬ 
malised in its constraints, while the second comes in as 
a loss function to be minimized. Each threshold defines 
which training samples the respective E-SVM is scoring 
positively. By lowering a threshold we cover more positives 
and thereby satisfy more constraints, but we also include 
more negatives and therefore suffer a greater loss. Any pos¬ 
itive sample can be potentially covered by any E-SVM, but 
at a different loss. This combinatorial nature of the problem 
makes it difficult to find the global optimum. We propose 
an efficient, globally optimal optimization technique. By 
exploiting the structure of the problem we are able to iden¬ 
tify areas of the solution space that cannot contain the op¬ 
timal solution and discard them early on. Our globally op¬ 
timal algorithm is able to calibrate a few hundred E-SVMs 
quickly. In order to solve larger problems with thousands 
of E-SVMs, we present a simple modification of our exact 
algorithm to deliver high quality approximate solutions. 

The rest of the paper is organized as follows. We start by 
reviewing related work in sec. 2. Sec. 3 introduces the for¬ 
mulation of our optimization problem, while sec. 4 presents 
our algorithm for efficiently finding the global optimal so¬ 
lution as well as its approximation. We train EE-SVM on 
state-of-the-art CNN descriptors [13] and present experi¬ 
ments on 10 classes of the ILSVRC 2014 dataset [14] and 


on all 20 classes of PASCAL VOC 2007 [15] in sec. 5. 
These experiments show that (i) our joint calibration pro¬ 
cedure outperforms standard independent sigmoid calibra¬ 
tion [1] on the task of classifying windows as belonging to 
an object class or not; and (ii) this translates to better object 
detection performance. Finally, we conclude in sec. 6. 

2. Related Work 

In the machine learning literature, classifier calibration 
has been considered in the context of deriving probabilistic 
output for binary classifiers [16-19] or multi-class classi¬ 
fication [10, 2C]. Multi-class problems are often cast as a 
series of binary problems (e.g. 1-vs-all) and [1, 20, 21] 
showed that calibrating these binary classifier often leads to 
improved prediction. 

The two most popular methods for calibrating binary 
classifiers are Platt scaling [16] and isotonic regression [17]. 
They both fit a mono tonic function of the classifier score 
to the empirical label probability, obtaining an estimate 
of the conditional probability of a class label given the 
score. Platt scaling [16] uses a simple sigmoid function, 
while [17] employs a more flexible isotonic regression. In 
computer vision, Platt scaling is the most popular calibra¬ 
tion tool [1,5, 22]. We compare to both methods in sec. 5.3. 

All these works [16-19] assume that the set of positive 
training samples for each classifier is fixed and given be¬ 
forehand, even if small. In contrast, in the EE-SVM model, 
any positive sample can potentially be associated with any 
E-SVM. In the original EE-SVM paper [1] this was resolved 
in a greedy fashion, where each E-SVM was calibrated in¬ 
dependently. The association of a positive sample to an E- 
SVM was resolved by comparing its uncalibrated E-SVM 
score to a fixed threshold. Instead, we calibrate E-SVMs 
and associate positive samples with them jointly over all 
positives and all E-SVMs. Our joint formulation (sec. 3) 
ensures that every positive is associated with at least one 
E-SVM, while the total number of false positives is mini¬ 
mized. As an additional benefit, this enables removing up 
to 25% of redundant E-SVMs that are not associated with 
any positives after the global optimum is found. 

Two interesting exceptions to the classic EE-SVM cal¬ 
ibration procedure [1] were presented recently [10, 12]. 
Gronat et al. [10] learns a per-location classifier for vi¬ 
sual place recognition, while [12] learns exemplar-based 3D 
“chair” representations. Both works employ a calibration 
strategy based purely on negative samples, sidestepping the 
association of positive samples to E-SVMs. For complete¬ 
ness, we compare to [L ] in sec. 5.3. All techniques re¬ 
viewed above calibrate each E-SVM independently. 

3. Joint calibration formulation 

In many object detection pipelines [13, 23, 24] a single 
linear classifier w G R d is applied to all K candidate win- 
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Figure 1: Example of window proposals used in our calibration 
technique. V is the set of positive windows (□) and Af is the set 
of negative windows (□) in the training set. Finally, ( ) indicates 
E-SVMs ground-truth bounding-box. A window is positive if it has 
an intersection-over-union > 0.5 with a ground-truth box [15]. 


dows {x}f =1 in an image, where x G R d is the window 
descriptor. The windows are then ranked according to the 
classifier score w -x. An EE-SVM, instead, contains E clas¬ 
sifiers: {wj}f =1 . The score of a window x is defined as the 
highest score among all classifiers applied to it: 


S(w) = ma x(wj • x) (1) 

3 

Our goal is to find a threshold 6j for each E-SVM eg such 
that (i) all positive windows are scored positively by at 
least one E-SVM, and (ii) the number of negative windows 
scored positively by any E-SVM is minimized. A window 
x is scored positively by E-SVM ej if Wj • x — 9j >0. We 
formalize these criteria in an optimization problem: 

£(©) 


min 

®={0j}f =1 


E l[ma xiwj-x-Oj)} 

xEAf 3 


( 2 ) 


s.t. l[ma x(wj • x — 


6j)\ > 0, Vx G V 


where 1 is the indicator function and V and Af are the sets 
of positive and negative windows in the training set (fig. 1). 
We refer to the top term as the loss function £(©) and the 
bottom terms as the constraints. 

Calibration is performed by adjusting the thresholds 0. 
Given a configuration of thresholds 0 = [#i, # 2 , • • • @e\, 
the loss C(&) counts the number of negative windows 
scored positively after the max operation. Each constraint i 
ensures that a positive window Xi is scored positively by at 
least one E-SVM. We refer to a configuration 0 satisfying 
all the constraints as a feasible solution. 


4. Globally optimal and efficient solution 

In this section we develop a computationally efficient al¬ 
gorithm to find the global optimal solution of (2). We start 
in sec. 4.1 by analysing the space of all possible solutions of 
(2). In sec. 4.2 we then introduce a data structure to repre¬ 
sent this space, and finally in sec. 4.3 we present an efficient 
algorithm to search this data structure for the globally opti¬ 
mal solution. 
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Figure 2: Illustration of our joint calibration algorithm, (left) shows the window scores of the two E-SVMs e\ and e 2. ( middle ) shows the 
candidate thresholds for these two E-SVMs. These are [Of Of] and [Of 0f0%], respectively. Finally, (right) shows the tree representing 
the space of all possible solutions. Note how the only feasible threshold configurations are those in the leaves. 


4.1. Space of candidate thresholds 

At first sight, (2) appears to be a continuous optimiza¬ 
tion problem where each threshold can take any value in M. 
However, since E-SVMs are evaluated only on a finite set 
of training windows, there exist an infinite set of equivalent 
thresholds leading to the same loss. For this reason, (2) is 
in practice a discrete optimization problem. 

Fig. 3 shows an example. Since each constraint in (2) 
evaluates one positive windows, an E-SVM needs at most 
P + 1 thresholds to satisfy each of them (fig. 2), where 
P = [P\. Furthermore, considering a threshold between 
two positive samples is not necessary, because the loss only 
changes when new negative samples are scored positively. 
The only thresholds worth considering are those between 
the score of a negative sample and a positive (not the re¬ 
verse, fig. 3). We denote the set of candidate thresholds for 
an E-SVM ej as [0j, Of, ... O^ 3 ], where 1 < Mj < P + 1 
and Of < Of, for Va, b : 1 < a < b < Mj. By construction, 
the lowest threshold 0 satisfy all the constraints in ( 2 ). 

To conclude, the number of candidate thresholds for an 
individual E-SVM is relatively small (at most P + 1), but 
the joint space of E-SVMs thresholds is nonetheless huge. 
In the worse case scenario (all E-SVMs have P candidate 
thresholds), there are P E threshold configurations, many of 
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Figure 3: Candidate thresholds, given the scores on positive (+) 
and negative ( - ) windows. The only thresholds worth considering 
according to (2) are the ones between a negative and positive win¬ 
dow. Of all equivalent thresholds between two window scores, we 
consider only the mean of the two scores. 


which are not a feasible solution to (2). In the next section 
we present a data structure to enumerate all these configu¬ 
rations and highlight the feasible solutions. 

4.2. Exhaustive search tree 

We represent the space of all possible solutions as a 
search tree (fig. 2 ). 

Definition 1. (Search Tree) Our search tree T is a per¬ 
fect k-ary tree: a rooted tree where every internal node has 
exactly k children and all leaves are at the same depth h. 
Each node 77 contains a configuration © = [# 1 , # 2 , • • • @e\ 
of thresholds. 

A configuration 0 at node 77 is used to compute the loss 
£(©) by counting how many negative windows are scored 
positively according to (2). The root node has the configu¬ 
ration 0 = [Of Of ... Off] of all tightest threshold for each 
E-SVM. We denote the set of false positives at a node 77 as 
£77 and £(S V ) = |£^|- Note how the root has £ = 0. 

In our representation we have h = P and k = E. Each 
level l of the tree corresponds to a positive window pi , and 
each edge corresponds to an E-SVM ej. An edge ej in¬ 
dicates that ej is responsible for pi and it should score it 
positively, hence satisfying one constraint of (2). Given an 
E-SVM ej and its current threshold Oj , the edge lowers the 
threshold so that Wj • xi — 0 3 > 0 (if this condition is already 
satisfied then the threshold does not change). Lowering the 
threshold might increase the loss, but not necessarily. Low¬ 
ering the threshold will make E-SVM ej score positively 
some negative windows, but these affect the loss only if they 
were not already scored positively by another E-SVM. 

The deeper the level, the more constraints of (2) are sat¬ 
isfied. By construction, the configuration of thresholds at a 
leaf satisfy all constraints, and the set of all leaves represent 
the set of all feasible solutions. Also note that the number of 
false positives always increases (or remains the same) along 
a path from the root to a leaf: given a node 77 and any child 
77 ', we have that C(S V ) < £(0^/). 

















































4.3. Efficient search 

In this section we find the global optimal solution to (2) 
by searching the tree. Exhaustive search is computationally 
prohibitive even for small problems with few E-SVMs and 
positive windows, as the total number of nodes in our tree is 
(E p+l — l)/(E — 1) (with E the number of E-SVMs and 
P the number of positive windows). 

The key to our efficient algorithm is to prune the tree 
by iteratively removing subtrees which cannot contain the 
global optimal solution. In the following paragraphs we 
present several observations that enable to drastically re¬ 
duce the space of solutions to consider. The last paragraph 
presents the actual search algorithm. 

Observation 1. (Pruning by bound). If p is a leaf and p' 
is a node not on the path from the root to p, and £(Sri) < 
£(0^/), then the subtree rooted at r}' cannot contain a bet¬ 
ter solution than p and can be discarded. 

The key intuition is that the loss can only increase 
with depth. The observation leads to two cases. First, if 
£( 0 ^) < £( 0 ^/), then p' cannot lead to an optimal solu¬ 
tion since its loss is already higher than an already found 
feasible solution. Second, if £(0^) = £(0^/), p' could 
lead to an optimal solution, but it would be equivalent to the 
one in 77 . In both cases we can discard the subtree rooted at 
p', as we are interested in finding only one optimal solution. 

Consider the example in fig. 4. Since £(0^) = 5 < 
£( 0 ^ 2 ) = 6 , the subtree rooted at P 2 cannot contain an 
optimal solution: any solution in it has a loss > 6 , which is 
higher than the feasible solution in 773 . 

Observation 2. (Pruning by equivalence). If two nodes 
p and p' have the same parent and ^ then they are 

equivalent and only one subtree needs be searched. 

The key intuition is that the loss in (2) only increases 
when new negative windows are scored positively. Con¬ 
sider the example in fig. 5, where and pi and P 2 

have the same parent (i.e., they satisfy the same constraints 
of ( 2 )). 771 has 0 = [Q\, 6\] because the edge from 770 to 
771 adjusted the threshold of e\. Note, however, that this 
configuration can be changed into [0 \, 6%] without increas¬ 
ing the loss, as they are equivalent solutions. Because of 
this equivalence, both subtrees lead to equivalent feasible 
solutions and only one of them needs be searched. 

Observation 3. (Reducing tree depth). Given the root 
configuration 0 = • • • 0 E \, there might exist some 

x £ V : ma Xj(wj • x — 0 1 -) >0. Since 0 already satisfies 
the constraint for these positives at zero cost, these can be 
eliminated right away, reducing the depth of the tree. 

Consider the example in fig. 2. Initially, 0 = [0\,0\]- 
This configuration already scores p\ positively. Whatever 
optimal solution 0 the tree retrieves, p\ will always be 



Figure 4: Example of pruning by bound. Since £(©773) = 5 < 
£(©^ 2 ) — 6 , the subtree rooted at 772 cannot contain an optimal 
solution. 




Figure 5: Example of pruning by equivalence. Since the two 
nodes p± and 772 have the same parent and = £ V2 , they are 
equivalent and only one subtree needs be searched. 



Figure 6 : Efficient pruning by bound can be achieved by sorting 
the positive windows by decreasing difficulty. 

scored positively by at least E-SVM e\. Hence, we can 
eliminate it from the tree to reduce its depth. 

Observation 4. (Order of positive windows). By sorting 
the positive windows by decreasing difficulty, pruning by 
bound can discard larger subtrees. The difficulty of a posi¬ 
tive window x is measured as mim, S(ej,x). 

where 5(ej,x) counts how many false positives e 3 pro¬ 
duces by scoring the positive sample x E V positively. 

The key intuition is that it is better to prune subtrees 
rooted at the higher levels of the tree, as they contain more 
nodes. This can be achieved by placing difficult positive 
windows up in the higher levels. Some positive windows 
lead to constraints intrinsically more difficult to satisfy than 
others, as any E-SVM asked to satisfy it would score posi¬ 
tively many negatives as well, and hence incur a large loss. 
By sorting the tree levels according to the difficulty of pos¬ 
itive windows, it is likely that the loss of many high-level 
configurations is higher than a previously found feasible so¬ 
lution, and therefore can be pruned (observation 1 ). 

Consider the example in fig. 6 . Tree (a) evaluates first pi 











and then p 2 , while (b) does the opposite. Tree (a) cannot be 
pruned by observation 1 , but tree (b) can. 

In sec. 5.5 we evaluate experimentally how effective the 
above pruning techniques are on various real EE-SVM cal¬ 
ibration problems. 

Search algorithm. We present here a depth-first search 
algorithm to efficiently find the global optimal solution, 
based on the above observations (Algo. 1). 

The algorithm works as follows. The initial configura¬ 
tion of thresholds 0 is the one from the root node (line 1). 
As preprocessing, the algorithm starts by reducing the tree 
depth using observation 3 (line 2) and re-ordering the posi¬ 
tive windows using observation 4 (line 3). In the first step, it 
does a depth-first search until it reaches a leaf 77 and finds a 
first feasible solution 0 (line 4). During this traversal, when 
going down a level, the algorithm always chooses the edge 
leading to the smallest loss. Next, the algorithm continues 
by going up (line 6) and down (line 10) the tree. When vis¬ 
iting a node, the algorithm tries to prune as many children 
subtrees as it can using observation 1 (line 8) and observa¬ 
tion 2 (line 9). When the algorithm reaches a leaf then this 
must contain a better solution than the current one 0 (in 
term of the loss (2)). Hence, it updates 0 (line 13). The algo¬ 
rithm continues until all nodes have been visited or pruned. 
The final 0 is the global optimum of (2). 

4.4. Approximate search 

Above we presented an efficient algorithm that guaran¬ 
tees global optimality. If we relax the global optimality 
requirement, we can improve efficiency even further. Our 
method follows a depth-first search and as soon as it reaches 
a leaf, then it finds a feasible solution. This happens peri¬ 
odically during the execution of the algorithm, as better and 
better leaves are found while the tree is searched. This be¬ 
haviour makes our method an any-time algorithm [25, 26]. 
After a short period required to reach the first leaf, we can 
terminate it at any time and it will return the best feasible 
solution it has found so far (although not necessarily the 
globally optimal one). This simple observation enables to 
employ our method, essentially unchanged, to find approx¬ 
imate solutions as well. 

5. Experiments 
5.1. Datasets 

We present experiments on ILSVRC2014 [b ] 
(sec. 5.3, 5.4) and PASCAL VOC 2007 [15] (sec. 5.4). 
ILSVRC2014 [14] contains 200 classes annotated by 
bounding-boxes. In our experiments we randomly sampled 
10 classes: airplane, bagel, baseball, bear, butterfly, koala, 
ladle, printer, sheep and violin. Following [] ] we consider 
three disjoint subsets of the data: train,vali and val 2 . 
Since annotations for the test set are not released, we 


Algorithm 1 Our Efficient Search Algorithm 

Input: search tree T 
Output: global optimal 0 

1 : 0 [0{, Or,, • • • &e\ 

2: T <- Reduce Tree Depth(T, 0 ) 

3 : Tf- Reorder Positive Windows(T, 0) 

4 : 77, 0, C <- Depth First Search(T, 0) 

5 : while T not fully searched do 
6: 77 <- Go Up One Level (T, 77) 

7 : while -1 isLeaf(77) A -1 hasChild(t7) do 

8: T c- Prune By Bound (T, 77, C) 

9 : T c- Prune By Equivalence (T, 77, 0 ) 

10: 77 <- Go Down One Level (T, 77) 

ll: end while 

12: if isLeaf(77) then 

13 : 0, C c- Get Solution^) 

14 : end if 

15 : end while 

16: return 0 


measure performance on val 2 . We use vali and train 
for training. In total, these sets contain >80k images. 

PASCAL VOC 2007 [15] contains 20 classes annotated 
by bounding-boxes. In our experiments we evaluate on all 
20 classes. We use the subset trainval for training and 
we measure performance on test. In total, these sets con¬ 
tain about 10 k images. 

5.2. Settings 

Object proposals and features. We generate class- 
independent object proposals using [27]. Given an im¬ 
age, this produces a small set of a few thousand windows 
likely to cover all objects. We then extract state-of-the-art 
CNN descriptors of 4096 dimensions for these proposals, as 
in [13]. These descriptors are the output of a convolutional 
neural network (CNN) initially trained for image classifica¬ 
tion [28, 29] and then fine-tuned for object detection [1 ] 
(on vali of ILSVRC2014, or on trainval of PASCAL 
VOC 2007). 

EE-SVM. We learn a separate window classifier e for 
each instance of an object in the training set. We set 
C = 10 -4 and we mine hard negatives from 2000 random 
training images. In our experiments we observed that min¬ 
ing more images did not bring a significant improvement. 

Calibration data. For each class we define V as the set 
of all positive training windows. A window is considered 
positive if it has intersection-over-union (IoU) [15] > 0.5 
with a ground-truth bounding-box of that class. Moreover, 
J\f contains negative windows that overlap < 0 . 2 . All cali¬ 
bration methods below are trained from this data. 






ILSVRC 2014 - trained on Val 1 

Airplane 

Bagel 

Baseball 

Bear 

Butterfly 

Koala 

Ladle 

Printer 

Sheep 

Violin 

mean 

Recall 

94.1 

90.1 

97.9 

93.1 

97.5 

96.6 

79.5 

88.5 

98.9 

85.3 

92.2 

EE-SVM no calibration 

180124 

171966 

499056 

33664 

163727 

80145 

250602 

221117 

308458 

37519 

195k 

EE-SVM indep. sigmoid calibration [1] 

119552 

42165 

234734 

77099 

53986 

13017 

56616 

88390 

86507 

33266 

80k 

EE-SVM joint calibration 

65182 

28460 

180658 

22694 

53927 

10746 

87513 

36573 

55923 

32570 

57k 

EE-SVM joint calibration w/ sigmoid 

64996 

28140 

168129 

22876 

53867 

10722 

87329 

35803 

50064 

33899 

55k 

EE-SVM indep. isotonic regression [20] 

54173 

23625 

268302 

16424 

43580 

13507 

60285 

55129 

76997 

13814 

63k 

EE-SVM indep. affine calibration [12] 

51905 

24507 

224003 

18978 

37483 

9947 

67288 

86757 

110909 

18218 

65k 

Single Linear-SVM (R-CNN) [13] 

102676 

335122 

711185 

109480 

63849 

305931 

322332 

469777 

979131 

121050 

341k 


Table 1: Window classification - False positives at test recall. Results on a subset of ILSVRC 2014 Val 2 (all positive windows and 
one million randomly sampled negative ones). We use the optimal thresholds found by our algorithm (sec. 4) to compute recall on Val 2 . 
This is the percentage of positive windows scored positively by our jointly calibrated EE-SVM. The table entries show the number of false 
positives produced in order to reach that recall level. Each row corresponds to a different method (ours are marked joint calibration’). 


ILSVRC 2014 - trained on Vali 

Airplane 

Bagel 

Baseball 

Bear 

Butterfly 

Koala 

Ladle 

Printer 

Sheep 

Violin 

mAP 

EE-SVM indep. sigmoid calibration [1] 

42.8 

39.7 

63.3 

58.7 

60.8 

58.2 

4.5 

29.0 

49.5 

20.3 

42.7 

EE-SVM joint calibration w/ sigmoid 

43.3 

40.1 

66.5 

60.6 

63.9 

61.1 

5.0 

31.6 

55.1 

22.9 

45.0 

EE-SVM indep. isotonic regression [20] 

44.6 

42.4 

61.4 

59.3 

63.8 

63.2 

5.7 

22.5 

50.9 

23.2 

43.7 

EE-SVM Indep. affine calibration [12] 

45.6 

42.0 

64.1 

59.2 

62.6 

62.2 

6.3 

22.9 

50.4 

25.9 

44.1 

Single Linear-SVM (R-CNN) [13] 

47.9 

36.9 

65.0 

60.9 

66.7 

63.4 

5.4 

24.4 

50.6 

19.1 

44.0 


Table 2: Window classification - Average precision. Results on a subset of ILSVRC 2014 Val 2 (same data as table 1). 


Independent sigmoid calibration. As a baseline we 
compare against the standard technique of Malisiewicz et 
al. [1]. It operates in two steps. In step (1) it runs each 
E-S VM detector separately on a validation set, applies non¬ 
maximum suppression, and then eliminates all detections 
scoring below the — 1 margin. All remaining detections are 
considered positives if they belong to V, or negatives if they 
belong to A f. Note how this arbitrarily defines which posi¬ 
tive training samples to associate with a certain E-SVM. In 
step (2), it then fits a logistic sigmoid to these data samples. 

Our joint calibration. Our joint calibration also operates 
in two steps. In step (1), instead of arbitrarily defining the 
positive training samples, our technique use the thresholds 
found by our algorithm (sec. 4) to associate positive samples 
to E-SVMs, which is the core underlying problem at the 
heart of such calibration. More specifically, for an E-SVM 
ej, we consider as positives all windows x G V : Wj • 
x > Oj , and as negatives all windows in A f. Step (2) of 
our procedure is then the same as in [1], but thanks to these 
optimal assignments, we fit better sigmoids. 

We experimentally evaluate performance after each step. 
We refer to the output of step (1) as joint calibration , and 
to the output of step (2) as joint calibration with sigmoid. 
As step (1) fits thresholds, it results in binary classification 
of test windows, while step (2) produces a continuous score 
which can be used for later processing stages (e.g. non¬ 
maximum suppression for object detection). 

Other independent calibration techniques. For com¬ 
pleteness, we also compare against two independent cali¬ 
bration techniques not commonly used for EE-SVM: iso¬ 
tonic regression [2 ] and the recent affine calibration [12]. 
Isotonic regression fits a piecewise-constant non-decreasing 
function to the output of each E-SVM. We used the code of 


[30] to train the function parameters on V and J\f. Affine 
calibration fits an affine transformation to the output of each 
E-SVM. As in [12], we train the affine parameters on 200k 
randomly sampled negative windows from J\f. 

Single Linear-SVM (R-CNN). Finally, we provide re¬ 
sults for the state-of-the-art R-CNN object detection 
model [13]. The sole purpose is to compare performance 
to EE-SVM on CNN features, as previous EE-SVM works 
typically use weaker HOG features [1-7, 9, 11]. However, 
as it consists of a single linear SVM per class, R-CNN can¬ 
not associate training exemples to objects detected in test 
images. Hence, it is not suitable for annotation transfer. We 
trained the model using the code and parameters of [13]. 
Note how this uses the same object proposals and features 
as our EE-SVM models. 

5.3. Globally optimal joint calibration 

We evaluate here our globally optimal joint calibration 
technique on ILSVRC2014 [ 14 ]. We train E-SVMs on 
vali for the 10 classes listed in sec. 5.1. Each class has 
between 30 and 140 E-SVMs and between 500 and 3600 
positive windows V. We evaluate two tasks: window clas¬ 
sification and object detection. 

Window classification 

For this experiment, we use all positive windows ( IoU > 
0.5) in the test set val 2 and 1 million randomly sampled 
negative ones (IoU < 0.5). We evaluate window classifica¬ 
tion in terms of two measures: false positives at test recall, 
and average precision. 

False positives at test recall. This measure counts how 
many false positives are produced on the test set, at the re¬ 
call point produced by the ensemble of E-SVMs calibrated 
by our method. Note this is exactly what our calibration 
procedure optimizes for. Given the thresholds © output by 




















Figure 7: Association between detected objects and training exemplars. Our globally optimal joint calibration is good at transfer¬ 
ring annotations from exemplars onto test windows. In these figure we show detections (green) and their associated training exemplar 
superimposed on them (yellow). 


ILSVRC 2014 - trained on Vali 

Airplane 

Bagel 

Baseball 

Bear 

Butterfly 

Koala 

Ladle 

Printer 

Sheep 

Violin 

mAP 

EE-SVM indep. sigmoid calibration [1] 

43.3 

11.9 

27.2 

45.2 

51.1 

46.3 

0.6 

8.4 

31.4 

7.4 

27.3 

EE-SVM joint calibration w/ sigmoid 

46.2 

10.1 

41.3 

44.7 

66.8 

41.4 

1.0 

10.8 

34.3 

9.5 

30.6 

EE-SVM indep. isotonic regression [20] 

34.9 

13.0 

31.4 

36.1 

59.2 

48.6 

0.6 

13.0 

31.0 

3.8 

27.2 

EE-SVM indep. affine calibration [12] 

45.8 

10.3 

41.0 

44.1 

66.1 

39.5 

0.8 

11.4 

35.9 

9.6 

30.5 

Single Linear-SVM (R-CNN) [13] 

49.0 

17.4 

45.4 

53.3 

69.6 

61.4 

2.8 

18.9 

41.5 

11.0 

37.0 


Table 3: Object detection - Average precision. Results on ILSVRC 2014 Vali. 


our algorithm (sec. 4), we compute recall as the percentage 
of positive windows scored positively by the ensemble on 
the test set (top row of table 1). Interestingly, the thresholds 
generalize well to test data and lead to high recall on almost 
all classes. 

We compare several methods at this recall point. The 
main four are EE-SVM with no calibration, EE-SVM with 
independent sigmoid calibration [1], our joint calibration 
fitting thresholds and our joint calibration with sigmoid (ta¬ 
ble 1, rows 2-5). As expected, EE-SVM with no calibra¬ 
tion performs very poorly and some form of calibration is 
necessary. Our joint calibration method considerably out¬ 
performs independent calibration, and the version with sig¬ 
moid brings another small boost in performance. These re¬ 
sults demonstrate the benefit of our joint calibration, that 
takes into account the max operation of the EE-SVM. Given 
these results, we omit EE-SVM with no calibration from 
further analysis. Table 1 also presents results for isotonic 
regression, affine calibration and R-CNN. On average, our 
joint calibration outperforms all these methods, albeit by a 
smaller margin. 

Average precision. In table 2 we compare techniques in 
terms of average precision. As this measure requires a con¬ 
tinuous score of test windows, we only consider our joint 
calibration with sigmoid. Joint calibration outperforms in¬ 
dependent sigmoid calibration on all classes, and improves 
mAP by 2.3%. This further highlights the benefits of joint 
calibration, in a scenario that is not exactly what it was op¬ 
timized for. Table 2 also presents results for isotonic re¬ 
gression, affine calibration and R-CNN. On average, joint 
calibration performs better than all these methods, abeit by 
a modest margin (about +1% mAP). 


Object detection 

We evaluate our joint calibration method against indepen¬ 
dent sigmoid calibration on the task of object detection. 
Note how this task adds a layer of non-maximum sup¬ 
pression (NMS) to the pipeline. As our calibration pro¬ 
cedure does not take into account NMS, it is not obvious 
that the benefit seen so far on window classification will 
carry over to object detection. As table 3 shows, joint 
calibration outperforms independent sigmoid calibration on 
this task as well (+3.3% mAP). Joint calibration performs 
equally or better on all classes but koala. For some classes 
the improvement is substantial: +14% AP on baseball and 
+14.7% AP on butterfly. 

Furthermore, table 3 also presents results for isotonic re¬ 
gression, affine calibration and R-CNN. Isotonic regression 
performs comparably to independent sigmoid calibration, 
whereas affine calibration delivers about the same mAP as 
our joint calibration. Interestingly, R-CNN does consider¬ 
ably better than all other methods, including our EE-SVM 
with joint calibration. This is somewhat surprising, as EE- 
SVM was shown much better than a single linear SVM 
on HOG features [1]. We attribute this phenomenon to 
the CNN features, which are more easily linearly separa¬ 
ble [13, 31-33]. Besides, note that despite the high per¬ 
formance, R-CNN lacks the crucial ability of EE-SVM to 
associate training exemplars to objects detected in test im¬ 
ages, and it is therefore not suitable for annotation transfer. 

5.4. Approximate joint calibration 

In this section we evaluate our approximate joint calibra¬ 
tion technique (sec. 4.4). By relaxing global optimality, we 
can find a feasible solution even for large problems. 













ILSVRC 2014 - trained on Val i+t rain 

mean 

Recall 

85.2 

EE-SVM no calibration 

137k 

EE-SVM indep. sigmoid calibration [ ] 

107k 

EE-SVM joint calibration 

82k 

EE-SVM joint calibration w/ sigmoid 

54k 


Table 4: Window classification - False positives at test recall. 

Results on a subset ofILSVRC 2014 Val 2 (mean over 10 classes). 
We used all positive windows and 2 million randomly sampled neg¬ 
ative ones. 


ILSVRC 2014 - trained on Val i+t rain 

mAP 

EE-SVM indep. sigmoid calibration [ ] 
EE-SVM joint calibration w/ sigmoid 

39.7 

42.9 


Table 5: Window classification - Average precision. Results on 
a subset ofILSVRC 2014 Val 2 (mean over 10 classes, same data 
as table 4 ). 

ILSVRC2014. We experiment here by training and cali¬ 
brating EE-SVM on the union of val 1 and train. This 
results in a large number of E-SVMs. Each class has be¬ 
tween 640 and 2000 E-SVMs, and between 3000 and 13000 
positive windows V. We report results on the task of win¬ 
dow classification averaged over the 10 classes. Note that 
these results cannot be compared to the ones in tables 1 and 
2 because here we have larger training and test sets. 

False positives at test recall. As table 4 shows, our approx¬ 
imate joint calibration procedure still achieves high recall 
while returning much fewer false positives than no calibra¬ 
tion and independent sigmoid calibration. When adding a 
sigmoid our calibration improves even further by a good 
margin. This shows that our method provides an excellent 
association between positive windows and E-SVMs. 

Average precision. Results are presented in table 5. Joint 
calibration improves over independent sigmoid calibration 
by +3.2% mAP. 

PASCAL VOC 2007. In order to compare against the 
original EE-SVM of [ ], we experiment here on the PAS¬ 
CAL VOC 2007 dataset. We train and calibrate EE-SVM on 
the trainval subset and evaluate them on test. We re¬ 
port results on object detection in terms of mAP over the 20 
classes (table 6). We compare traditional EE-SVM on HOG 
features (19.8 mAP, as reported by [ ]), independently cali¬ 
brated EE-SVM on CNNs (40.8 mAP), and our joint cal¬ 
ibration on the same features (42.7 mAP). These results 
highlight two points: (1) joint calibration improves over 
independent calibration by +2% mAP, confirming what ob¬ 
served on ILSVRC2014; (2) CNN features bring a huge im¬ 
provement over HOG to EE-SVM models (doubling mAP 
in this case). This confirms recent findings [13] about the 
benefits of CNN features for object detection. 


PASCAL VOC 
2007 test 

Feature 

Calibration 

mAP 


HOG 

independent 

19.8 

EE-SVM 

CNN 

independent 

40.8 


CNN 

joint 

42.7 


Table 6: Object detection - Average precision. Results on PAS¬ 
CAL VOC 2007 test (mean over 20 classes). EE-SVM HOG 
results are from [1 ]. 

5.5. Pruning statistics and runtimes 

Pruning statistics. Here we experimentally evaluate how 
effective our pruning techniques of sec. 4.3 are. Obser¬ 
vation 3 (line 2 , Algo. 1) reduces the depth of the tree by 
20%, on average. Observation 4 ( line 3) improves pruning 
by bound immensely. In a small problem with 100 E-SVMs 
we tried ordering the positive windows V randomly. The 
algorithm took two hours to find the global solution. On the 
other hand, when sorting V according to observation 4, the 
algorithm found the same solution in about 2 minutes. Ob¬ 
servations 1 and 2 also bring a substantial speed-up. After 
finding a first feasible solution (line 1 ), the algorithm ( lines 
8,9) prunes 40% of the nodes it visits on problems with 100 
E-SVMs, and 70% on problems with 1000 E-SVMs. 
Runtime. We measure runtimes on a 4-cores Intel Core 
i5 2.0GHz. Exhaustive search is extremely inefficient and 
takes 15h to find the globally optimal solution for a tiny 
problem with 15 E-SVMs and 50 positive windows. Our 
efficient and exact algorithm (sec. 4.3) finds the same so¬ 
lution in just a few seconds. This algorithm scales up to 
problems with about 200 E-SVMs and 4k positive windows 
in reasonable time (a few hours). For larger problems we 
rely on our approximate search algorithm (sec. 4.4). While 
we let it run for several hours, in most cases the loss stops 
decreasing significantly already after a few minutes. 

6. Conclusion 

We presented a method for calibrating the Ensemble of 
Exemplar SVMs model. While the standard approach cali¬ 
brates each S VM independently, our method optimizes their 
joint performance as an ensemble. We formulated joint cal¬ 
ibration as a constrained optimization problem and devised 
an efficient optimization algorithm to find its global opti¬ 
mum. In order to make the optimization computationally 
feasible, the algorithm dynamically discards parts of the so¬ 
lution space that cannot contain the optimum, by exploiting 
four observations about the structure of the problem. 

We presented experiments on 10 classes from the 
ILSVRC 2014 dataset and 20 from PASCAL VOC 2007. 
Our joint calibration procedure outperforms the classic in¬ 
dependent sigmoid calibration by a considerable margin on 
the task of classifying windows as belonging to an object 
class or not. On object detection, this better window classi¬ 
fier leads to an improvement of about 3% mAP. 
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